27 research outputs found
Biomedical ontology MeSH improves document clustering qualify on MEDLINE articles: A comparison study
19th IEEE International Symposium on Computer-Based Medical Systems, CBMS 2006, Salt Lake City, UTDocument clustering has been used for better document
retrieval, document browsing, and text mining. In this paper,
we investigate if biomedical ontology MeSH improves the
clustering quality for MEDLINE articles. For this
investigation, we perform a comprehensive comparison study
of various document clustering approaches such as
hierarchical clustering methods (single-link, complete-link,
and complete link), Bisecting K-means, K-means, and Suffix
Tree Clustering (STC) in terms of efficiency, effectiveness,
and scalability. According to our experiment results,
biomedical ontology MeSH significantly enhances clustering
quality on biomedical documents. In addition, our results
show that decent document clustering approaches, such as
Bisecting K-means, K-means and STC, gains some benefit
from MeSH ontology while hierarchical algorithms showing
the poorest clustering quality do not reap the benefit of
MeSH ontology
A comprehensive comparison study of document clustering for a biomedical digital library MEDLINE
Presented at the 2006 ACM/IEEE Joint Conference on Digital Library (JCDL 2006), June 11-15, 2006, Chapel Hill, NC, USA. Retrieved 6/26/2006 from http://www.ischool.drexel.edu/faculty/thu/My%20Publication/Conference-papers/JCDL06.pdf.Document clustering has been used for better document retrieval,
document browsing, and text mining in digital library. In this
paper, we perform a comprehensive comparison study of various
document clustering approaches such as three hierarchical
methods (single-link, complete-link, and complete link), Bisecting
K-means, K-means, and Suffix Tree Clustering in terms of the
efficiency, the effectiveness, and the scalability. In addition, we
apply a domain ontology to document clustering to investigate if
the ontology such as MeSH improves clustering qualify for
MEDLINE articles. Because an ontology is a formal, explicit
specification of a shared conceptualization for a domain of
interest, the use of ontologies is a natural way to solve traditional
information retrieval problems such as synonym/hypernym/
hyponym problems. We conducted fairly extensive experiments
based on different evaluation metrics such as misclassification
index, F-measure, cluster purity, and Entropy on very large article
sets from MEDLINE, the largest biomedical digital library in
biomedicine
A coherent biomedical literature clustering and summarization approach through ontology-enriched graphical representations
Data Warehousing and Knowledge Discovery, Proceedings 4081, pp. 374-383, DOI: http://dx.doi.org/10.1007/11823728In this paper, we introduce a coherent biomedical literature clustering
and summarization approach that employs a graphical representation method
for text using a biomedical ontology. The key of the approach is to construct
document cluster models as semantic chunks capturing the core semantic
relationships in the ontology-enriched scale-free graphical representation of
documents. These document cluster models are used for both document
clustering and text summarization by constructing Text Semantic Interaction
Network (TSIN). Our extensive experimental results indicate our approach
shows 45% cluster quality improvement and 72% clustering reliability
improvement, in terms of misclassification index, over Bisecting K-means as a
leading document clustering approach. In addition, our approach provides
concise but rich text summary in key concepts and sentences. The primary
contribution of this paper is we introduce a coherent biomedical literature
clustering and summarization approach that takes advantage of ontologyenriched
graphical representations. Our approach significantly improves the
quality of document clusters and understandability of documents through
summaries
A semantic approach for mining hidden links from complementary and non-interactive biomedical literature
Presented at the 2006 SIAM Conference on Data Mining (SIAM DM 2006). Retrieved 6/26/2006 from http://www.ischool.drexel.edu/faculty/thu/My%20Publication/Conference-papers/SIAM06-Hu.pdf.Two complementary and non-interactive literature sets
of articles, when they are considered together, can
reveal useful information of scientific interest not
apparent in either of the two sets alone. Swanson
called the existence of such hidden links as
undiscovered public knowledge (UPK). The novel
connection between Raynaud disease and fish oils was
uncovered from complementary and non-interactive
biomedical literature by Swanson in 1986. Since then,
there have been many approaches to uncover UPK by
mining the biomedical literature. These earlier works,
however, required substantial manual intervention to
reduce the number of possible connections. This paper
proposes a semantic-based mining model for
undiscovered public knowledge using the biomedical
literature. Our method replaces manual ad-hoc
pruning by using semantic knowledge from the
biomedical ontologies. Using the semantic types and
semantic relationships of the biomedical concepts, our
prototype system can identify the relevant concepts
collected from Medline and generate the novel
hypothesis between these concepts. The system
successfully replicates Swansonâs two famous
discoveries: Raynaud disease/fish oils and
migraine/magnesium. Compared with previous
approaches such as LSI-based and traditional
association rule-based methods, our method generates
much fewer but more relevant novel hypotheses, and
requires much less human intervention in the
discovery procedure
Mining hidden connections among biomedical concepts from disjoint biomedical literature sets through semantic-based association rule
Paper accepted for publication in Journal of Information Systems. Retrieved 6/26/2006 from http://www.ischool.drexel.edu/faculty/thu/My%20Publication/Journal-papers/JIS_hu2006.pdf.The novel connection between Raynaud dise ase and fish oils was
uncovered from two disjointed biomedical literature sets by Swanson in 1986.
Since then, there have been many approaches to uncover novel connections
by mining the biomedical literature. One of the popular approaches is to adapt
the Association Rule (AR) method to automatically identify implicit novel
connections between concept A and concept C from two disjointed sets of
documents through intermediate B concept. Since A and C concepts do not
occur together in the same data set , the mining goal is to find novel connection
among A and C concepts in the disjoint data sets. It first applies association rul e
to the two disjointed biomedical literature sets separately to generate two rule
sets (AĂ B, BĂ C), and then applies transitive law to get the novel connection s
AĂ C. However, this approach generates a huge number of possible
connections among the millions of biomedical concepts and a lot of these
hypothetical connections are spurious, useless and/or biologically meaningless.
Thus it is essential to develop new approach to generate highly likely novel and
biologically relevant connections among the biomedical concepts. This paper
presents a Biomedical Semantic-based Association Rule System (Bio - SARS)
that significantly reduce spurious/useless/biologically irrelevant connections
through semantic filtering. Compared to other approaches such as LSI and
traditional association rule-based approach, our approach generates much fewer
rules and a lot of these rules represent relevant connections among biological
concepts
Enabling multi-level relevance feedback on PubMed by integrating rank learning into DBMS
Background: Finding relevant articles from PubMed is challenging because it is hard to express the user's specific intention in the given query interface, and a keyword query typically retrieves a large number of results. Researchers have applied machine learning techniques to find relevant articles by ranking the articles according to the learned relevance function. However, the process of learning and ranking is usually done offline without integrated with the keyword queries, and the users have to provide a large amount of training documents to get a reasonable learning accuracy. This paper proposes a novel multi-level relevance feedback system for PubMed, called RefMed, which supports both ad-hoc keyword queries and a multi-level relevance feedback in real time on PubMed.
Results: RefMed supports a multi-level relevance feedback by using the RankSVM as the learning method, and thus it achieves higher accuracy with less feedback. RefMed "tightly" integrates the RankSVM into RDBMS to support both keyword queries and the multi-level relevance feedback in real time; the tight coupling of the RankSVM and DBMS substantially improves the processing time. An efficient parameter selection method for the RankSVM is also proposed, which tunes the RankSVM parameter without performing validation. Thereby, RefMed achieves a high learning accuracy in real time without performing a validation process. RefMed is accessible at http://dm.postech.ac.kr/refmed.
Conclusions: RefMed is the first multi-level relevance feedback system for PubMed, which achieves a high accuracy with less feedback. It effectively learns an accurate relevance function from the user's feedback and efficiently processes the function to return relevant articles in real time.1114Nsciescopu
ACKNOWLEDGEMENTS
I am indebted to many people for their support and advice to the successful completion of my Ph.D degree and this dissertation. My deepest gratitude goes to my supervisor, Dr. Xiaohua Hu, for his guidance and assistance with this dissertation as well as all the research during my doctoral research endeavor for the past four years. He has helped me to move forward with investigation in-depth and to remain focused on achieving my goal. I am grateful to my committee members, Dr. Il-Yeol Song, Dr. Xia Lin, Dr. Bahrad A. Sokhansanj, and Dr. Don Goelman, for their invaluable advice and suggestions. Especially, Dr. Song has always been meticulous in proofreading my research papers. His advice on both academic and non-academic matters has been inestimable. I would like to express my appreciation to my parents, SungTae Yoo and SunJa Park, and to my parents-in-law, TaeWhan Jung and SoonAe Goo, for their love, support and encouragement. I would like to express my sincere thanks to my wife YoungJae Jung for her love and sacrifice. Without her constant sacrifice, this thesis would not have bee